Got R? Catch ’em all!

Objective

3 Web scraping methods

  • We’re gonna’ do 3 different web scraping tasks in 5 minutes from a single site
  1. Scrape a table of original 151 Pokemon stats from one webpage
  2. Scrape 151 images of Pokemon from 151 seperate webpages
  3. Build a plot that scrapes the .pngs of each Pokemon from 151 webpages by itself

starters

Pokeball - Rvest Table Scraping

Pluck the data from Bulbapedia

bulbapedia

Libraries and Data

library(tidyverse)
library(xml2)
library(rvest)
library(stringr)

bulbagarden <- "http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_base_stats_(Generation_I)"

PokedexR

baseStats <-
  read_html(x = bulbagarden) %>%
  html_node(css = "div table") %>%
  html_table() 

SelectorGadget (like a silph scope but for web pages)

  • rvest recommends SelectorGadget is a chrome extension for CSS selector generation.
  • It “Makes the Invisible Plain to See!” by exposing which parts of the html correspond to which bits of the user facing webpage. SelectorGadget
  • However, I had to fudge it a bit, as it didn’t pick up the table properly.

All the data

head(baseStats, n = 3)
##   #      Pokémon HP Attack Defense Speed Special Total Average
## 1 1 NA Bulbasaur 45     49      49    45      65   253    50.6
## 2 2 NA   Ivysaur 60     62      63    60      80   325    65.0
## 3 3 NA  Venusaur 80     82      83    80     100   425    85.0

Muky Data

baseStats %>% 
  select(-2) %>% 
  rename(DexNo = `#`, 
         Pokemon = !!names(.[2])) -> baseStats

Greatball - Rvest image scraping

Grab all 151 images

get_img_url <- function(x, url = bulbagarden) {
  read_html(url) %>%
    html_nodes(css = "#mw-content-text img") %>%
    str_split_fixed("src=\"", n = 2) %>%
    .[, 2] %>%
    str_split_fixed("\" width=", n = 2) %>%
    .[, 1] %>%
    paste0("https:", .)
}

baseStats %>% 
  mutate(
    img_url = get_img_url(Pokemon)
    ) -> baseStats

Congratulations, you caught all 151 image urls!

baseStats[1:3]
##     DexNo    Pokemon  HP
## 1       1  Bulbasaur  45
## 2       2    Ivysaur  60
## 3       3   Venusaur  80
## 4       4 Charmander  39
## 5       5 Charmeleon  58
## 6       6  Charizard  78
## 7       7   Squirtle  44
## 8       8  Wartortle  59
## 9       9  Blastoise  79
## 10     10   Caterpie  45
## 11     11    Metapod  50
## 12     12 Butterfree  60
## 13     13     Weedle  40
## 14     14     Kakuna  45
## 15     15   Beedrill  65
## 16     16     Pidgey  40
## 17     17  Pidgeotto  63
## 18     18    Pidgeot  83
## 19     19    Rattata  30
## 20     20   Raticate  55
## 21     21    Spearow  40
## 22     22     Fearow  65
## 23     23      Ekans  35
## 24     24      Arbok  60
## 25     25    Pikachu  35
## 26     26     Raichu  60
## 27     27  Sandshrew  50
## 28     28  Sandslash  75
## 29     29   Nidoran♀  55
## 30     30   Nidorina  70
## 31     31  Nidoqueen  90
## 32     32   Nidoran♂  46
## 33     33   Nidorino  61
## 34     34   Nidoking  81
## 35     35   Clefairy  70
## 36     36   Clefable  95
## 37     37     Vulpix  38
## 38     38  Ninetales  73
## 39     39 Jigglypuff 115
## 40     40 Wigglytuff 140
## 41     41      Zubat  40
## 42     42     Golbat  75
## 43     43     Oddish  45
## 44     44      Gloom  60
## 45     45  Vileplume  75
## 46     46      Paras  35
## 47     47   Parasect  60
## 48     48    Venonat  60
## 49     49   Venomoth  70
## 50     50    Diglett  10
## 51     51    Dugtrio  35
## 52     52     Meowth  40
## 53     53    Persian  65
## 54     54    Psyduck  50
## 55     55    Golduck  80
## 56     56     Mankey  40
## 57     57   Primeape  65
## 58     58  Growlithe  55
## 59     59   Arcanine  90
## 60     60    Poliwag  40
## 61     61  Poliwhirl  65
## 62     62  Poliwrath  90
## 63     63       Abra  25
## 64     64    Kadabra  40
## 65     65   Alakazam  55
## 66     66     Machop  70
## 67     67    Machoke  80
## 68     68    Machamp  90
## 69     69 Bellsprout  50
## 70     70 Weepinbell  65
## 71     71 Victreebel  80
## 72     72  Tentacool  40
## 73     73 Tentacruel  80
## 74     74    Geodude  40
## 75     75   Graveler  55
## 76     76      Golem  80
## 77     77     Ponyta  50
## 78     78   Rapidash  65
## 79     79   Slowpoke  90
## 80     80    Slowbro  95
## 81     81  Magnemite  25
## 82     82   Magneton  50
## 83     83 Farfetch'd  52
## 84     84      Doduo  35
## 85     85     Dodrio  60
## 86     86       Seel  65
## 87     87    Dewgong  90
## 88     88     Grimer  80
## 89     89        Muk 105
## 90     90   Shellder  30
## 91     91   Cloyster  50
## 92     92     Gastly  30
## 93     93    Haunter  45
## 94     94     Gengar  60
## 95     95       Onix  35
## 96     96    Drowzee  60
## 97     97      Hypno  85
## 98     98     Krabby  30
## 99     99    Kingler  55
## 100   100    Voltorb  40
## 101   101  Electrode  60
## 102   102  Exeggcute  60
## 103   103  Exeggutor  95
## 104   104     Cubone  50
## 105   105    Marowak  60
## 106   106  Hitmonlee  50
## 107   107 Hitmonchan  50
## 108   108  Lickitung  90
## 109   109    Koffing  40
## 110   110    Weezing  65
## 111   111    Rhyhorn  80
## 112   112     Rhydon 105
## 113   113    Chansey 250
## 114   114    Tangela  65
## 115   115 Kangaskhan 105
## 116   116     Horsea  30
## 117   117     Seadra  55
## 118   118    Goldeen  45
## 119   119    Seaking  80
## 120   120     Staryu  30
## 121   121    Starmie  60
## 122   122   Mr. Mime  40
## 123   123    Scyther  70
## 124   124       Jynx  65
## 125   125 Electabuzz  65
## 126   126     Magmar  65
## 127   127     Pinsir  65
## 128   128     Tauros  75
## 129   129   Magikarp  20
## 130   130   Gyarados  95
## 131   131     Lapras 130
## 132   132      Ditto  48
## 133   133      Eevee  55
## 134   134   Vaporeon 130
## 135   135    Jolteon  65
## 136   136    Flareon  65
## 137   137    Porygon  65
## 138   138    Omanyte  35
## 139   139    Omastar  70
## 140   140     Kabuto  30
## 141   141   Kabutops  60
## 142   142 Aerodactyl  80
## 143   143    Snorlax 160
## 144   144   Articuno  90
## 145   145     Zapdos  90
## 146   146    Moltres  90
## 147   147    Dratini  41
## 148   148  Dragonair  61
## 149   149  Dragonite  91
## 150   150     Mewtwo 106
## 151   151        Mew 100

Your Pokemon have been moved to “Someones PC”!

dir.create("./someonesPC/")

for(url in baseStats$img_url) {
  download.file(
    url,
    destfile = paste0(
      "./someonesPC/",
      baseStats[baseStats$img_url == url,]$DexNo,
      ".png"), 
    mode = "wb"
    )
}

DaveRGP checked “Someones PC”!

someones PC

Masterballin’ - Rbokeh URL plotting

I’m gonna be the very best…

library(rbokeh)

P <- figure(title = "Pokemon by Total Stats in Pokedex Order") %>% 
  ly_image_url(
    data = baseStats,
    x = DexNo,
    y = Total,
    w = 10,
    h = 20,
    image_url = img_url,
    anchor = "center"
  )

…like no one ever was!

Trainer Tips

ProgramRmons can…

  • Write R functions
  • Write base R for loops
  • Use web scraping functions from packages
    • Use xml2::read_html() to read the whole page into memory
    • Use rvest::html_node() to find individual parts of the page
    • Use rvest::html_nodes() (plural) to return multiple items
    • Use rvest::html_table() to return a data.frame (matrix)
  • Use stringr package for manipulating urls
  • Use rbokeh package to make interactive plots with .pngs sourced by url

squirtles

Trainer Card

David Parr

github: DaveRGP

this project: https://github.com/DaveRGP/GotRCatchEmAll

this presentation: https://rpubs.com/DaveRGP/GotRCatchEmAll